Scaling Up a Distributed Computing Of Similarity Coefficient with Mapreduce

نویسندگان

  • Mirel Cosulschi
  • Mihai Gabroveanu
  • Florin Slabu
  • Adriana Sbircea
چکیده

The work presented in this paper addresses the design and implementation of a Hadoop application and the experiments performed with this application in order to compute the Jaccard similarity metrics for two very large graphs. The algorithm involved uses the MapReduce programming model, whose aim is to distribute the computing process over several machines in order to reduce the overall running time. As a distributed programming model, MapReduce is one of the most important techniques behind Cloud computing metaphor, focused on data intensive computing in clustered environments. Hadoop open source framework provides to developers a Java API for implementing applications based on MapReduce programming paradigm. In this philosophy, the main task is divided into several smaller subtasks that can be executed or re-executed on any node in the cluster. The experimental results presented in this paper were obtained after performing various tests over two large data sets (WEBSPAM-UK 2007 and Slashdot) on a distributed cluster.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Large Scale Machine Translation Architecture

Parallelization is widely considered to be the future of high performance computation, and is a natural choice when scaling up the machine translation systems. In this report, a programming model called MapReduce is investigated and two supporting components for MapReduce framework to work efficiently are analyzed, namely the distributed storage for streaming data and distributed storage for st...

متن کامل

ClusterJoin: A Similarity Joins Framework using Map-Reduce

Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they ...

متن کامل

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastruc...

متن کامل

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCSA

دوره 12  شماره 

صفحات  -

تاریخ انتشار 2015